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1.  Introduction 


1.1  General  Problem 

The  network  simulation  engine,  ns-3,  is  a  discrete-event  network  simulator. 
Simulations  are  composed  of  models  that  are  coded  using  the  C++  programming 
language.  While  some  graphical  tools  exist  for  creating  network  topologies  (e.g., 
Boston  University  Representative  Internet  Topology  Generator  [BRITE])* 1  writing 
models  and  scenarios  remains  a  manual,  complex,  and  time-consuming  task. 
Models  are  used  to  represent  nodes,  computers,  and  other  networking  components 
such  as  physical  network  interface  cards  (NICs),  network  addresses,  protocols, 
etcetera.  When  put  together,  these  models  compose  a  scenario.  Many  of  the 
protocol  models  that  are  provided  by  ns-3  are  generic  and  when  analyzed  with 
Wireshark,2  produce  network  traffic  that  differs  greatly  from  real  protocol  traffic. 
The  accuracy  of  simulation  results  relies  on  these  models,  yet  there  is  a  lack  of 
model  validation.  To  address  these  issues,  we  introduce  a  tool  that  can  take  a  traffic 
capture  of  a  network  and  extract  the  necessary  protocol  fields,  their  vocabulary  and 
its  state  machine,  automating  the  generation  of  a  precise  protocol  model  based  on 
the  behavior  of  real  execution. 

1.2  Summary  of  Methodology 

Our  steps  in  developing  the  model  generator  included  researching  existing 
technologies  that  could  extract  fields  from  binaries  or  protocols.  We  identified 
limitations  and  implemented  a  system  that  could  utilize  some  of  these  tools  to 
extract  the  vocabulary  and  grammar.  We  collected  3  network  traffic  captures.  The 
first  was  a  capture  from  a  real  network  Internet  Control  Message  Protocol  (ICMP) 
Ping.  For  the  second,  we  created  an  ns-3  scenario  with  2  nodes;  1  node  sent  ICMP 
Ping  packets  to  the  other  using  the  icmpv4  model  (i.e.,  the  Ping  simulator  that 
comes  with  ns-3).  Third,  we  used  our  model  generator  to  create  a  Ping  model  and 
a  scenario  automatically  (using  the  live  capture  as  input).  In  many  aspects,  the 
capture  from  our  model  generator  was  the  most  similar  to  the  live  capture. 

2.  Methods  and  Procedures 


We  started  by  investigating  the  current  state  of  the  art  in  reverse  engineering  and 
binary  analysis  of  network  protocols.  We  identified  tools,  algorithms,  and 
approaches  that  we  could  leverage  for  our  work.  Afterward,  we  designed  our 
system  with  a  modular  architecture  to  facilitate  debugging  and  to  support  future 
component  integration.  Lastly,  we  built  our  system  and  tested  its  effectiveness  by 
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comparing  its  outputs  against  2  controls.  We  describe  this  in  the  following 
subsections. 

2.1  Gap  Analysis 

The  actual  behavior  of  protocols  can  vary  depending  on  their  implementation. 
These  variances  within  the  same  type  of  protocols  can  yield  unexpected  results 
when  perfonning  field  tests.  The  success  of  these  tests  depends  greatly  on  the 
accuracy  of  the  protocol  models  that  make  up  the  scenario.  The  data  that  results 
from  nodes  communicating  a  protocol,  including  the  field  vocabulary  (i.e.,  the 
possible  values  that  packet  fields  can  hold),  can  affect  the  outcome  of  these  tests. 
Detennining  the  infonnation  required  for  extraction  from  the  network  packets  was 
crucial  for  the  generation  of  precise  models. 

2.1.1  A  Survey  of  Commonly  Used  Protocol  Reverse  Engineering  Tools 

Security  analysts  are  often  presented  with  network  traffic  captures  that  contain 
undocumented  protocols  or  binary  files.  The  content  and  purpose  of  these  protocols 
is  usually  a  manual  process  that  consists  of  an  analyst  conducting  reverse 
engineering.  This  is  the  process  of  extracting  the  structure,  attributes,  and  data  from 
a  network  protocol  implementation  without  access  to  its  specification.3  Several 
tools  have  been  implemented  in  attempts  to  provide  a  solution  to  this  problem. 
Some  of  these  tools  are  described  in  the  following  subsections. 

2. 1.1.1  Automatic  Semantics-Aware  Analysis  of  Network  Payloads  (ASAP)  and 
Protocol  Inspection  and  State  Machine  Analysis  (PRISMA) 

The  ASAP4  is  an  open-source  tool,  written  for  the  “R”5  statistical  computing 
language  that  is  designed  to  automatically  extract  semantics-aware  components 
from  recorded  traffic.  The  author,  Tainmo  Krueger,  utilized  this  tool  to  characterize 
nonnal  network  protocol  behavior  by  analyzing  infonnation  in  specified  fields,  and 
then  subsequently  to  identify  and  sanitize  anomalous,  malicious  hypertext  transfer 
protocol  (HTTP)  requests.  The  ASAP  leverages  2  other  tools  as  part  of  its  analysis 
process.  The  PRISMA4  contributes  to  ASAP  by  learning  the  stateful  models  from 
the  network  traffic  of  a  service  that  can  be  used  for  simulating  valid 
communication.  Sally6  is  a  small  tool  for  mapping  a  set  of  strings  to  a  set  of  vectors. 
Together  these  tools  contribute  to  ASAP’s  functionality: 

1)  An  alphabet  of  relevant  strings  is  extracted  from  raw  data  and  used  to  map 
payloads  into  a  vector  space  for  analysis. 

2)  Matrix  factorization  is  applied  to  identify  base  directions  in  the  vector 
space,  characterizing  usage  patterns  of  mapped  payloads. 
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3)  Base  directions  are  traced  back  to  a  conjunction  of  strings  from  the  alphabet 
resulting  in  a  template  of  typical  communication  content. 

ASAP  initially  uses  Sally  to  read  a  raw  file,  “asap.raw”,  which  contains  raw 
network  packet  payloads.  When  Sally  maps  out  the  strings  to  a  vector  space,  they 
are  characterized  by  specific  features  that  include  bytes,  tokens,  n-grams  of  bytes, 
or  n-grams  of  tokens.  It  also  generates  a  sparse  vector  of  count  values.  ASAP  then 
takes  in  this  output  and  extracts  an  alphabet  constructed  from  a  set  of  basic  strings 
and  relevant  strings.  This  alphabet  is  then  refined  using  filtering  and  correlation 
techniques. 

Alphabet  Extraction 

Part  of  the  alphabet  extraction  process  involves  applying  correlation  techniques  in 
order  to  identify  nonconstant  and  nonvolatile  strings.  Strings  within  network 
payloads  naturally  appear  with  different  frequency,  ranging  from  volatile  to 
constant  occurrences.  For  instance,  HTTP  requests  contain  the  string  “HTTP”  in 
the  header,  whereas  other  parts — such  as  timestamps  or  session  numbers — are 
highly  variable.  Statistical  t-tests  are  applied  to  detennine  a  p-value,  which 
indicates  the  likelihood  of  encountering  certain  values.  A  tutorial  for  using  this 
application,  provided  by  Tainmo  Krueger,  included  an  example  of  GET  requests, 
HTTP,  Simple  Mail  Transfer  Protocol  (SMTP),  and  File  Transfer  Protocol  (FTP), 
where  the  alphabet  is  based  on  tokens.  For  binary  network  protocols  such  as 
Domain  Name  System  (DNS),  Server  Message  Block  (SMB),  and  Network  File 
System  (NFS),  the  tool  utilizes  n-grams  for  the  alphabet.  Additional  correlation 
techniques  are  applied  to  combine  co-occurring  strings. 

Matrix  Factorization 

Mapping  the  network  payloads  to  a  vector  space  reflects  characteristics  captured 
by  the  alphabet.  Payloads  that  share  several  substrings  will  appear  closer  to  each 
other,  while  those  with  different  content  are  farther  apart.  This  process  allows  the 
discovery  of  semantics-aware  components  and  base  directions  in  the  vector  space. 

Generating  Communication  Templates 

Strings  of  the  alphabet  that  exceed  a  specific  threshold  are  then  selected  inside  the 
base  directions  to  construct  a  template.  Alphabets  of  n-grams  are  concatenated  to 
regain  parts  of  the  original  ordering.  For  example,  if  we  have  a  basis  containing  the 
3 -grams  “Hos”A”osf ’A”st’ ’,  then  it  can  be  inferred  that  these  tokens  overlap  and  can 
be  concatenated  to  make  up  the  string  “Host”.  The  procedure  is  then  repeated  until 
no  more  overlaps  exist.  This  token  is  then  added  to  the  representation  list  and  the 
procedure  is  repeated  for  the  next  token  with  the  highest  value  left. 
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2. 1.1. 2  Prospex 

Prospex7  is  a  system  that  can  automatically  infer  specifications  for  stateful  network 
protocols,  including  state  machine  infonnation.  This  tool  can  extract  the  format 
specifications  from  a  message  by  monitoring  the  application  as  it  processes  its 
inputs.  The  Prospex  system’s  main  contributions  follow: 

1)  The  system  can  determine  when  2  messages  in  the  session  are  similarly 
based,  not  only  on  their  formats,  but  on  the  effects  that  they  have  on  the 
receiving  server’s  execution. 

2)  A  state  machine  can  be  inferred,  which  specifies  the  order  in  which 
messages  can  be  exchanged — given  no  prior  knowledge  of  the  protocol 
under  analysis. 

This  system  consists  of  several  modules.  The  state  machine  inference  module, 
which  is  available  as  open  source,  is  used  for  our  model  generator.  This  module 
accepts  a  text  file,  “sessions.txt”,  which  is  composed  of  the  sequences  of  messages 
that  are  observed  within  a  PCAP  file.  A  state  machine  is  then  produced  from  these 
sequences  representing  the  behavior  of  the  network  protocol.  Another  algorithm, 
Exbar8  is  then  applied  to  reduce  (compress)  the  state  machine  by  merging  similar 
states.  This  minimal  state  machine  represents  the  behavior  of  the  network  protocol. 

2. 1.1.3  Netzob 

Netzob9  is  an  open-source  reverse  engineering  toolset  sponsored  by  IMOSSYS  and 
SupeTec.  Netzob  has  a  rich  feature  set  that  includes  inter-  and  intrapacket- 
dependency  inference,  packet  simulation,  and  exports  for  Wireshark  and  Peach10 
pit  files.  It  enables  protocol  message  format  and  state  machine  inference  through 
passive  (i.e.,  no  human  intervention)  and  active  processes.  Afterward,  the  model 
can  be  used  to  simulate  realistic  and  controllable  traffic.  This  toolset  has  been 
successful  at  reversing  unknown  protocols,  such  as  the  Zeroaccess  botnet  command 
and  control  (C2)  communication.11  Figure  1  shows  the  main  modules  of  Netzob. 
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Fig.  1  Netzob  main  modules 

•  Import  module:  Data  can  be  imported  utilizing  the  built  in  packet  sniffer 
or  by  specifying  an  existing  capture,  network  flow,  or  other  accepted 
formats. 

•  Protocol  inference  modules:  The  vocabulary  and  grammar  inference 
methods  constitute  the  core  of  Netzob.  This  tool  has  automated  and  manual 
mechanisms  built  in,  which  allows  the  reverse  engineering  of 
communication  flows. 

•  Simulation  module:  Netzob  utilizes  the  vocabulary  and  grammar  models 
previously  inferred  to  understand  and  generate  communication  traffic 
between  multiple  actors. 

•  Export  module:  Netzob  can  export  an  inferred  model  of  a  protocol  in 
multiple  formats  making  it  extendable  to  third-party  software. 

2.1.2  Applicability  and  Limitations  of  Existing  Tools  for  the  Model 
Generator 

We  wanted  to  leverage  existing  tools,  but  we  had  to  determine  their  suitability  for 
our  problem.  Most  tools  were  either  closed  source,  unavailable,  or  were  not  mature 
enough  for  our  needs.  However,  we  were  able  to  reuse  some  capabilities  of 
available  tools  (e.g.,  Exbar)  and  develop  our  software  in  a  way  that  supports  parallel 
development  (i.e.,  our  tool  will  plug-and-play  with  other  tools:  Prospex,  Netzob, 
and  ASAP).  In  the  following  subsections,  we  describe  our  analysis  of  these  tools 
and  some  limitations  we  encountered  when  considering  them  for  our  work. 


Approved  for  public  release;  distribution  is  unlimited. 

5 


2. 1.2.1  ASAP 


The  author  of  ASAP,  Tammo  Krueger,  utilized  ASAP/PRISMA  to  extract  network 
protocols  by  analyzing  information  in  specified  fields  and  then  identified  malicious 
HTTP  requests  and  sanitized  them.  He  provides  a  step-by-step  tutorial  that  shows 
the  process  of  passing  a  raw  data  file  into  ASAP  and  then  viewing  the  results.  This 
tutorial  includes  the  sample  input  files  containing  raw  network  payloads  with  HTTP 
GET  requests  (see  Fig.  2). 


GET  cgl/adain . php?actlon=renaae&par=NeBkxkAOrw  HTTP/1.1  Host:  www.roobar.co*  Accept:  */*  User-Agent:  0pera/9.20  (Winda 
get  static/iXAwxvxe . html  HTTP/1. 1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  uozllla/5.0  (Windows;  u;  windows  NT  6.0i 
GET  cgi/search. php?s-giwXxb268McP3GwLakG  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  0pera/9.20  (Windows  NT 
GET  cgi/adam .  php?actlon=aove&par=Q4BasOAhAuIiZe18  HTTP/1.1  Host:  www.foobar.coa  Accept:  +/*  User-Agent:  0pera/9.20  (M 
GET  cgl/adain . php ?ac t lon«show&par «RpkFd lol 00AHaD2Pf S  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  0pera/9.20 
GET  cgi/adam. php?actlon=aove&par^y9ppTXpK4juTE4s  HTTP/1. 1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  Mozilla/5.0  (W 
GET  cgi/search . php?s-GF6FU7MfcylN5P9C  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  0pera/9.20  (Windows  NT  5.1 
GET  cgi/search . php?s>VsvcxcIM  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  0pera/9.20  (Windows  NT  5.1;  U;  de) 
GET  cgi/adam .  php?actlon=renaae&par=ViTBkOXiYTlJ73o  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  Mozllla/S.O 
GET  cgi/adam. php?actlon-delete&par-qgOg4lfa  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  Mozllla/S.O  (Window 
GET  static/ZOsLZtls . htal  HTTP/1.1  Host:  www.foobar.coa  Accept:  •/♦  User-Agent:  Uozilla/S.O  (Windows;  U;  Windows  NT  6. OH 
GET  cgi/adain.php?actlonsrenaae&par=2p8xkOxfVBgZr  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  Mozllla/S.O  (VI 
GET  cgl/adain . php?ac tlon«renaaeipar «AyEKHYqiWQlFVd  http/1.1  Host:  www.foobar.coa  Accept:  */*  user-Agent:  Mozllla/S.O  ( 
GET  cgl/adain . php?action*renaae&par*2htTKuPZXi  HTTP/1.1  Host:  www.foobar.coa  Accept:  V*  User-Agent:  Uozilla/S.O  (Wind) 
GET  static/llbYOYcG.htal  HTTP/1.1  Host:  www.foobar.coa  Accept:  */♦  User-Agent:  Opera/9. 20  (Windows  NT  5.1;  U;  de) 

GET  stat ic/7VCT9CBd . htal  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  0pera/9.20  (Windows  NT  6.0;  U;  de) 
get  cgi/adain.php?actlon=delete&par=K6lWDplKRXZJjE0  HTTP/1.1  Host:  www.foobar.coa  Accept:  •/*  User-Agent:  opera/9.20 
GET  cgi/adam. php?actlon-delete&par-Th1ElQxWT  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  Mozilla/5.0  (Wlndoi 
GET  cgi/search . php?s«rX8YYvKGZ  HTTP/1.1  Host:  www.foobar.coa  Accept:  */*  User-Agent:  Uozilla/S.O  (Windows;  U;  Windows 


Fig.  2  Raw  network  packet 

For  the  purposes  of  the  model  generator,  we  attempted  to  run  this  tool  on  payloads 
containing  non-HTTP  data.  We  did  this  by  executing  every  step  in  the  tutorial,  with 
the  exception  of  using  a  different  file  as  input.  When  read  into  Sally,  a  new  file 
named,  “asap2. sally”,  was  generated  and  then  used  as  input  for  ASAP  (which  used 
R).  “R”  successfully  read  the  new  file  using  the  “loadPrismaData(asap2)” 
command,  but  failed  when  attempting  to  create  the  dataset  using  “data(asap2)’\ 
This  is  the  function  that  allows  one  to  see  the  matrix  and  other  pertinent  data.  After 
further  analysis  and  testing,  we  detennined  that  the  reason  the  “data(asap)” 
instruction  works  on  the  original  file  is  because  the  generated  dataset  is  already 
prepackaged  inside  of  PRISMA.  To  arrive  at  this  conclusion,  we  removed  all  of  the 
asap  files  from  the  folder  that  the  “R”  program  accesses  when  testing  the 
“data(asap)”  command.  This  command  still  succeeds,  meaning  that  the  file  is  not 
being  read  from  the  local  file  system,  but  instead,  is  being  accessed  from  the 
installed  packages. 

Another  limitation  with  ASAP  is  that  it  utilizes  a  delimiter  that  needs  to  be 
manually  specified  for  parsing  of  the  raw  data  file.  This  means  that  prior  knowledge 
of  the  protocol  is  required  for  this  program  to  succeed — rendering  the  process  only 
semiautomatic. 
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2. 1.2. 2  Netzob 


Netzob  has  been  used  to  successfully  reverse  engineer  unknown  protocols,  such  as 
the  Zeroaccess  botnet  C2  malware  communication.11  Being  open  source,  we  were 
hoping  to  take  advantage  of  this  tool’s  vocabulary  and  grammar  inference  modules. 
However,  it  is  currently  not  well  suited  for  documented  protocols;  there  is  no  way 
to  import  structures,  field  sizes,  or  dependencies  from  specifications  including 
requests  for  comment  documents  or  network  sniffer  data,  such  as  Wireshark  pdml 
data.  Grammar  inference  is  not  supported  for  Open  Standards  Interconnect  (OSI) 
Layer  3  and  below  protocols  such  as  Optimized  Link  State  Routing  (OLSR),  Open 
Shortest  Path  First  (OSPF),  Ping,  and  etcetera.  Lastly,  there  has  been  little 
maintenance  since  2013.  The  detailed  evaluation  steps  of  Netzob  are  in  the 
Appendix. 

2. 1.2. 3  Prospex 

Prospex7  was  very  promising  due  to  its  success  at  generating  protocol  state 
machines.  After  researching  several  tools,  one  common  property  found  in  many 
systems  is  that  they  focus  only  on  extracting  the  format  of  individual  protocol 
messages.  This  makes  generating  the  protocol  state  machine  and  producing  the 
specifications  for  stateful  network  protocols  much  more  difficult.  The  initial 
limitation  found  with  the  Prospex  tool  is  that  only  the  state  machine  inferencing 
tool  was  available  for  download,  along  with  the  Exbar  algorithm  needed  to 
minimize  the  state  machine.  The  last  phase  of  Prospex  is  shown  in  Fig.  3. 


Fig.  3  Prospex  system  overview7 

The  state  machine  inferencing  tool  reads  a  text  file  composed  of  sequences  of 
message  types  observed  from  a  PCAP  file.  Figure  4  shows  a  sample  input  file. 
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sessions.txt  (^/Desktop/prospex/test)  -  VIM 


File  Edit  View  Search  Terminal  Help 


8  0808080808080 


Fig.  4  Prospex  state  machine  inferencing  tool  input  file 

Figure  4  shows  one  sequence  of  ICMP  message  type.  Type  8  is  a  request,  and  type 
0  is  a  reply.  The  input  file  describes  the  order  in  which  the  different  message  types 
are  found  in  the  PCAP  file.  Utilizing  this  input,  this  tool  generates  a  state  tree  dot 
file.  The  dot  utility  is  part  of  the  Graphviz12  drawing  package.  This  is  used  to 
generate  a  visualization  of  the  state  tree  as  shown  in  Fig.  5a.  Exbar  compresses  this 
tree  and  generates  the  final  state  machine  file.  Figure  5b  shows  the  final  state 
machine  for  the  Ping  protocol. 


Fig.  5  State  tree  file  output  (left);  final  Ping  protocol  state  machine  (right) 
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Based  on  our  analysis,  we  detennined  that  the  State  Machine  Inferencing  feature  of 
the  Prospex  tool  would  be  beneficial  to  our  work.  In  order  to  incorporate  Exbar,  we 
had  to  generate  compatible  data  (i.e.,  the  input  file,  “sessions.txt”)  that  must  only 
contain  integer  values.  This  process  is  described  in  Section  2.2.3. 

2.1.3  Model  Generation 

To  generate  an  accurate  protocol  model,  there  is  specific  infonnation  needed  from 
the  packet  captures  (i.e.,  PCAP  files).  This  data  includes: 

•  Protocol  fields 
o  Size 

o  Position 
o  Value  (hex) 

•  Field  vocabulary 
o  Entropy 

•  Grammar  (state  machine) 

2. 1.3.1  Protocol  Fields 

Figure  6,  which  appears  in  Data  Communications  and  Networking,13  shows  the 
fields  that  exist  in  the  ICMP  Ping  protocol.  Protocols  are  described  by  their  fields, 
including  sizes,  possible  values,  and  states. 


Fig.  6  ICMP  header  format 

After  analyzing  different  tools,  we  found  that  Tshark14  was  the  best  choice  for 
extracting  the  fields  required  for  our  model  generation.  This  tool  is  a  command  line 
version  of  Wireshark.  It  has  a  vast  amount  of  existing  protocol  dissectors,  which 
allows  it  to  parse  the  structure  of  protocols  including  field  names,  data  types,  sizes, 
etcetera. 

2. 1.3. 2  Field  Vocabulary 

Figure  6  shows  that  the  ICMP  header  is  composed  of  4  fields.  Type,  for  instance, 
will  contain  the  type  of  message  that  is  being  sent.  In  this  case,  this  message  would 
be  a  Type  8,  “request”,  or  Type  0,  “response”.  Extracting  this  information  is  crucial 
for  generating  automated  model  generation.  The  vocabularies  for  protocol  fields 
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(i.e.,  the  values  that  can  exist  in  each  field)  can  be  identified  from  a  PCAP  file.  To 
measure  the  amount  of  variance  in  a  field’s  vocabulary,  we  calculate  its  entropy.  If 
the  entropy  is  closer  to  “1”,  then  there  is  more  randomness  to  this  particular  field. 
If  it  is  closer  to  “0”,  then  there  is  a  smaller  range  of  possible  values  that  exist  for 
that  field  within  the  PCAP.  Entropy  of  “0”  indicates  that  the  vocabulary  is  static, 
which  means  that  the  vocabulary  consists  of  a  single,  unchanging  value. 

To  facilitate  the  process  of  extracting  the  vocabulary,  we  used  Tshark  to  convert 
the  PCAP  files  (which  are  in  binary  format)  to  a  Packet  Details  Markup  Language 
(PDML15 — a  markup  language  that  is  very  similar  to  HTML  or  XML). 

2. 1.3. 3  Grammar  (State  Machine) 

Once  the  fields  and  vocabulary  are  extracted  from  the  PCAP  files,  knowing  how 
the  protocols  need  to  behave  is  the  next  step.  To  closely  mimic  protocol  behavior, 
we  needed  to  infer  the  state  machine  specification.  We  leveraged  the  State  Machine 
Inferencing  tool  from  Prospex.  To  do  this,  we  filtered  out  the  protocol  of  interest 
from  the  PCAP  file  and  then  extracted  the  message  types  from  the  fields  to  produce 
an  input  file  for  Prospex  to  generate  the  state  machine.  This  state  machine  is  then 
used  to  produce  the  ns-3  protocol  models. 

2.2  Path  Forward 

The  ns-3  model  generator  reads  a  PCAP  file  and — by  leveraging  Tshark  and 
Prospex — generates  the  protocol  model  C++  files  needed  to  run  the  simulation  in 
ns-3.  Figure  7  shows  the  system  overview. 


Fig.  7  ns-3  automated  model  generator  system  overview 
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This  tool  is  composed  of  7  different  modules.  It  takes  a  PCAP  file  as  input  and 
produces  3  C++  files  that  are  the  ns-3  scenario  and  protocol  model.  The  modules 
that  make  up  this  tool  are  listed  below. 

•  pdmlExtractor.py 

•  xmlToNs-3Scenario.py 

•  packetTypeExtractor.py 

•  fieldVocab.py 

•  fieldVocabToNs-3Model.py 

•  Prospex  (third  party) 

•  dotToCppGrammer.py 

The  “pdmlExtractor.py”  program  generates  2  files:  “fields.txt”  and  “flows. xml”. 
The  “flows. xml”  file  is  used  to  generate  the  scenario  standardized  XML  file  that  is 
eventually  passed  to  the  “xmlToNs-3Scenario.py”  program.  This  will  produce  the 
“summerPing. cc”  ns-3  simulation  file.  The  “fields.txt”  file  is  fed  into  the 
“packetTypeExtractor.py”  program  to  produce  the  “packetTypeSequence.txt”  file 
that  Prospex  will  eventually  use  to  produce  the  state  machine.  Both  the  “fields.txt” 
file  and  “packetTypeSequences.txt”  files  are  needed  by  the  “fieldVocab.py” 
program  to  produce  the  model  standardized  XML  file  that  is  input  into  the 
“fieldVocabToNs-3Model.py”  for  the  model  generation.  Prospex  will  then  read  in 
the  “packetTypeSequences.txt”  file  and  produce  the  state  machine  in  dot  format. 
The  “dotToCppGrammer.py”  then  uses  this  file  to  generate  a  C++  class 
interpretation  of  the  state  machine.  All  of  the  C++  files  generated  are  used  by  ns-3 
to  generate  the  protocol  simulation. 

2.2.1  "pdmlExtractor.py" 

The  “pdmlExtractor.py”  Python  script  reads  a  PCAP  file  as  input.  The  first  thing 
this  program  does  is  filter  all  of  the  protocols  in  the  PCAP  and  keep  only  the 
protocol  specified  when  the  program  is  run.  The  program  then  converts  the  PCAP 
file  from  binary  to  PDML  using  Tshark.  Figure  8  shows  how  the  PDML  file  is 
formatted. 
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python  pdmIExtractor.py  <name  of  PCAP>  <Protocol  Name> 


«pr  oto  nar  par  el>Mn«n<  ....  file  S  > 

•field  ntmtr  po:  '  she**-  T  shoMnaaer  Nunf»«r  value3'"^'  eiier  5i"/> 

•field  rwne=  po*=  show3  *howr»(we=  trewe  I*  i'i  value3  hsf  r/; 

•field  rvane=  •  p«*=  shot*:  •  sbownane3  -i  •  -  ,,i  value3  size3 

•field  name3  1  i  ■*•  ■  *  .  po*  '  shoe  .  ,  ■  r  iK«nr 

•/prate* 

•prate  nmr  frarm  shxnme’  ■  »  <  t,*«»  on  >iri  .4  fcitt  ■■■  t.to*  <».-* 

•field  none-  *  »—■  shownane3  J*»t«rf»c«  site3  pea3  aHo*^ 

•  field  no me-  ft  n  «|  •  ,  i  -  ltionnanc-  ■  •  -•  -  .  1  >  .  .  •  . , me.  • 

•field  nomas"  ft  1>«  '(Immmmb  4,  r  ».  el  Ti«e  Jun  i?.  MIS  11  S8  17  2281SMOO  I 

•field  nomer  tux  *ffa«v  si  .  •  •.  ihonnwr  Tim  shift  »«-  this  pai*e*.  0  cccc. 

•field  rvaitei  from  B^MCI  iKownemee  Ep©<h  Tim  14J4S55857  2291SM08  seconds* 

•field  rumc-  *  r  om  •  ■  -  1  ■  -  shownane*  ;  «vr  riel  <e  f  •  pr  <m  .  H  .  «p  t  ur  «d  f  •  ••» 

•  field  none-  ihoMieie  am  pervious  displt 

•  field  HMr*  shoHfumar  "  T  t  iefcl  BW  BOB  show3  '  /  • 

•field  n—gr-fr—  Iff"  dimn— r*Er—  Length  99  bpt»-  siz<r=  1  pos= 

•field  name-  1  •  •  •  -  -  *  -  sHownane3  -•  »  l  I  ■,•»»  'H  l'i»»  size3 

*(ieU  IMMB  frM  •*  E*d  shoxnjme-  Ci  Mi  ia  -  MM  POP"  shov 

•field  name:  1  .  »  -  iltcmniine  -  ..  .  ->  silo:  pec  si 

•field  npnc  ' *<  m  priMnU  shownsmcp  Ft  NMlU  in  frMM  tth  ip  l«M  m- 

</ prate* 

•proto  none3  Mk  shownome3  Ethernet  II.  S» «  t>«ll_41  b>  47  <  dO  <7  eS  41  b>  471.  Os 

•field  none3  •'  -  shownane-  0«itm»ii  Nvtgvor  1.4  li  <4  '4  V  0  l>  4  4 1 

<fi(ld  IUM*aA  »ddr'  ahewnasc  tdJi  «>•  IW|«M  M  N:M  421 

■field  none*  «th  lg  ahewnasar  0  •  L(  bit:  tlobslli 

•field  names  <-»k  jhowrvwe3  0  •  U  bit  Individ... 

•/field* 

•field^none3  -•  shoanase3  1  _  4 :  i  i  -  4,  i  sizes  • 


Fig.  8  PCAP  to  PDML  sample 


Once  the  file  is  converted,  the  program  will  traverse  each  packet  within  the  PCAP 
file  and  find  the  protocol  that  is  specified  when  the  program  is  called.  Once  it  finds 
the  position  in  the  packet,  it  remembers  that  position  for  each  packet.  Figure  9 
shows  an  example  of  the  line  in  the  PDML  file  that  shows  the  name  of  the  protocol. 
In  this  example  we  use  the  ICMP  protocol. 

•field  n«ne="ip  host"  shownamep" Source  or  Destination  Host:  8  3  8  8"  (iidep"yes"  sizes"*"  po»="30"  show= 
<field  neme=  "  show="Source  GeoIP:  Unknown"  size="4"  pos="26"  valuep" OaOl OaOd"/ :• 

<field  n«ne=“"  shows" Destination  GeoIP  Unknown"  size=  4"  po^*"30"  value=  08080808"/> 

^larnto* - - - 

£proto_nane="i  n A3howname=  1 1 nt*i  net  Control  Message  Protocol."  size=  *"  pos=  "34"> 

’-field  n«ne=  T  «>  t/pe  shown«me=  "Type  8  'Echo  ping'  request1"  sizeA'l"  pos=  '34  shcw=  volue=  "8"/ 
<field  ncme= " i  c rrf>  code"  shownamep  "Code:  0"  sizes “1"  po^"35"  show="0"  values " 00''/ > 

<field  names' lcnp  checksum"  shownames  "Checksum:  0i6fe9  [correct] "  size=  I  po*="36"  show="0i6f e9 '  value 
•field  names" icirp  checksum  bad"  shownames  "Bad  Checksum:  False"  hides"'jres"  sizes“2"  po^"36"  sho*^"0"  va 
<field  names  "lcmp  ident''  shownames" Identifier  'BE':  30501  (0x7725) “  size=  "J"  por"3l"  show=" 30501“  valu 
•field  name="i cap  ident"  shown«me=  ' Identifier  • L E •  .  9591  '0«25T'  size= "2"  pos="38"  show="9591"  value= 
<field  names  "lerrp  seq"  shownames  "Sequence  nunter  'BE':  1  '0i0001'"  sizep"2"  po^"40"  shows'!"  values"0C 
•field  names" leap  seq_l<"  shownames" Sequence  nuirfcer  (L6) :  25G  'OiOlOO'"  sizes  "2"  po»=  "40"  show="25G"  va 
<field  names" leap  data_ti me"  shownames "Ti me stanp  from  icrtp  data:  Jun  22.  2015  11:58:17.000000000  HOT"  s 
<field  names" i cap. data  time  relative  shownames "Ti aestaap  from  leap  data  'relative'  0  228152000  second 
<f ield  names  "data"  values  "0d7b0300000000001 01 11 21 31415161 7181 91alblcldlel f 2021 22232425262728292a2b2c2d2 
<field  names  "data. data"  shownames  "Data:  0d7b030000000000101 11 21 31415161 718191albl cl dlclf .  .  "  sizes"48 
<field  names  "data. len"  shownames  Length  48"  sizes '0”  pos="50"  shows  48"/> 

</f  ield> 

</pr  oto> 


O 

U 

o 

o 


Fig.  9  ICMP  protocol  found  example 

Most  of  the  lines  contain  the  “name”,  “showname”,  “show”,  “size”,  “pos”,  and 
“value”  attributes.  The  “pdmIExtractor.py”  will  extract  all  of  these  attributes  and 
write  them  out  to  a  file,  “fields.txt”  with  “#”  as  the  delimiter.  These  will  be  part  of 
the  vocabulary  and  will  be  used  to  calculate  the  entropy  for  each  field  of  the 
protocol.  The  program  then  continues  the  same  process  for  each  packet  within  the 
PCAP  file.  Figure  10  shows  an  example  “fields.txt”  file. 
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!•  I  (ir  V»*  tHNk  lnnml  M#<p 


Type:  8  (Echo  (ping)  request):  tfsize:  1  tfshow:  8  #pos:  34  tfvalue:  08  tfname:  leap. type 
Code:  0:  tfsize:  1  ft  show:  0  tfpos:  35  tfvalue:  00  tfname:  leap. code 

Checksum:  0x6fe9  [correct]:  tfsize:  2  tfshow:  0x6fe9  #pos:  36  tfvalue:  6fe9  #name:  leap. checksua 
Bad  Checksua:  False:  #size:  2  #show:  0  tfpos:  36  tfvalue:  6fe9  tfname:  ienp. checksum  bad 
Identifier  (BE):  30501  (0x7725):  tfsize:  2  tfshow:  30501  #pos:  38  tfvalue:  7725  tfname:  lcap.ldent 
Identifier  (LE):  9591  (0x2577):  tfsize:  2  #show:  9591  tfpos :  38  lvalue:  7725  tfname:  icap.ident 
Sequence  number  (BE):  1  (0x0001):  #size:  2  tfshow:  1  tfpos;  40  lvalue:  0001  tfname:  icap.seq 
sequence  number  (LE):  256  (0x0100):  #slze:  2  tfshow:  256  #pos:  40  tfvalue:  0001  tfname:  lcap.seq.le 
Timestamp  from  leap  data:  Jun  22,  2015  11:58:17.000000000  MDT :  #slze:  8  tfshow:  Jun  22.  2015  11:58:17.000000000  ftp¬ 
s'.  42  tfvalue:  b94c885500000000  tfname:  leap .data_t lae 

Timestamp  froa  leap  data  (relative):  0.228152000  seconds:  tfsize:  8  tfshow:  0.228152000  tfpos:  42  tfvalue:  b94c885500> 
00000  #naae:  leap .data_tiae_relative 

unspecified:  #size:  unspecified  #show:  unspecified  #pos:  unspecified  tfvalue:  0d7b030000000000101 1 121 3141516171819 
albl cl dl el f20212223242S262728292a2b2c2d2e2f 3031 323334353637  tfname:  data 

Data:  0d7b030000000000101112131415161718191a1b1c1d1e1f . . . :  tfsize:  48  tfshow:  Od : 7b: 03 : 00 : 00 : 00 : 00 : 00 : 10 : 1 1 : 1 2 : 1 3 : 1 
: 15:16:17: 18: 19: la: 1b:1c:1d:1e: If : 20:21 : 22:23: 24:25: 26:27:28:29: 2a: 2b: 2c: 2d:2e:2f: 30:31 : 32:33:34:35: 36: 37  tfpos:  5' 
tfvalue:  0d7b0300000000001 01 1 1 21 31 41 51 6171 81 91 al bl cl dlel f 2021 2223242S262728292a2b2c2d2e2f 3031 323334353637  tfname:  • 
ata  .data 

Length:  48:  tfsize:  0  tfshow:  48  tfpos:  50  tfvalue:  unspecified  tfname:  data.len 

_ U _ TOD 


Fig.  10  “flelds.txt”  file  example 


The  “pdmlExtractor.py”  program  also  generates  a  “flows.xml”  file.  This  file  will 
utilize  other  fields  within  each  packet  to  extract  the  following  information: 

1)  Source  Internet  Protocol  (IP)  address 

2)  Destination  IP  address 

3)  Source  Mac  address 

4)  Destination  Mac  address 

5)  Epoch  time 

This  infonnation  is  extracted  from  each  packet  within  the  capture  and  is  also  written 
to  the  flows  file.  Figure  1 1  shows  an  example  of  the  “flows.xml”  file  for  a  Ping 
protocol. 


xmlFlow.xml  (~/Desktop/xmlScript/pcapParser/09:20:18.4984! 

File  Edit  View  Search  Terminal  Help 


<root> 

<f low> 

<protocol>icmp</protocol> 
<macsrc>d0:67:e5:41:b9:d7</macsrc> 
<macdst>c4:04:15:30:b4:43</macdst> 
<ipsrc>10.1 .10.13</ipsrc> 
<ipdst>8.8.8.8</ipdst> 

<time>201 5-06-22  11:58:17: 2281 52</time> 
</f low> 

<f low> 

<protocol>icmp</protocol> 

<macsrc>c4:04:15:30:b4:43</macsrc> 

<macdst>d0:67:e5:41:b9:d7</macdst> 

<ipsrc>8.8.8.8</ipsrc> 

<ipdst>10.1 .10.13</ipdst> 

<time>201 5-06-22  11:58:17: 250320</time> 
</f low> 

<f low> 

<protocol>icmp</protocol> 
<macsrc>d0:67:e5:41:b9:d7</macsrc> 
<macdst>c4:04:15:30:b4:43</macdst> 
<ipsrc>10.1 .10.13</ipsrc> 
<ipdst>8.8.8.8</ipdst> 

<time>201 5-06-22  11 : 58: 18: 229549</ time> 
</f low> 


Fig.  11  “flows.txt”  file  example 


Approved  for  public  release;  distribution  is  unlimited. 

13 


This  “flows.xml”  file  is  also  used  to  generate  a  scenario  standardized  XML  file. 
After  researching  other  network  simulation  programs,  we  found  a  structure  that 
would  be  easy  to  integrate  with  other  simulation  engines.  This  structure  describes 
all  of  the  protocols  that  come  out  of  each  node  or  computer — assuming  that  the 
node  can  be  distinguished  based  on  a  single-source  Mac  address.  This  part  of  the 
program  will  tie  each  protocol  with  a  specific  node  and  will  list  the  source  addresses 
and  time,  etcetera.  Figure  12  shows  a  “scenarioStdConfigurationFile.xml”  file  for 
a  Ping  protocol  that  was  captured  with  Wireshark. 


<network> 

<node> 

<nuaber>0</nuaber> 

<iacsrc>d0:67:e5:41:b9:d7</iacsrc> 

<f  lov»> 

<protocol>icmp</protocol> 

<*acdst>c4:04:15:30:b4:43<aacdst> 

<ipsrc>10. 1.10. 13</ipsrc> 

<ipdst>8 .8.8. 8</ipdst> 

<ti«e> 

<first>201 5-06-22  11 : 58 : 1 7 : 2281 52</f irst> 
<last>2015-06-22  11 : 58 : 26 : 241 474</last> 
<count>1 0</count> 

</ti*e> 

flow> 

</node> 

<node> 

<nu«ber>1</nuiber> 

<iacsrc>c4:04:15:30:b4:43</«acsrc> 

<f low> 

<protocol>icmp</protocol> 

<aacdst>d0:67:e5:41:b9:d7<aacdst> 

<ipsrc>8 .8.8. 8</ipsrc> 

<ipdst>10. 1.10. 13</ipdst> 

<ti«e> 

<first>2015-06-22  11 : 58 : 1 7: 250320</f irst> 
< la st >201 5-06-22  11 : 58 : 25 : 283245</ last> 
<count>9</count> 

</ti«e> 
f  low> 

</node> 

</network> 


Fig.  12  ICMP  “flows.xml”  file 


This  example  describes  2  nodes  that  were  communicating  with  each  other  using  the 
Ping  protocol.  Each  node  is  numbered  and  is  distinguished  with  a  unique  Mac 
source  address.  It  also  includes  the  protocol  that  is  generated  by  the  node, 
encapsulated  within  the  flow  tag,“<flow>”.  This  flow  includes  the  protocol  name, 
the  Mac  destination  address,  and  IP  addresses.  The  epoch  time  and  the  number  of 
times  this  flow  was  seen  in  the  PCAP  file  are  also  described. 


2.2.2  "xmlToNs-3Scenario.py" 
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The  “xmlToNs-3Scenario.py”  Python  script  generates  the  actual  scenario  C++  file 
that  is  used  by  ns-3  to  run  the  simulation.  This  program  takes  the 
“scenarioXMLCon (iguration File. xml”  as  input. 

2.2.3  "packetTypeExtractor.py" 

One  common  feature  utilized  in  other  protocol  extraction  tools  is  the  use  of  a 
vocabulary  dictionary.  This  feature  was  useful  when  trying  to  extract  the  message 
type  for  each  protocol.  Each  protocol  has  a  message  type  that  is  described  by  one 
of  its  fields  within  the  protocol.  The  challenge  is  that  every  protocol  uses  a  different 
naming  convention  to  describe  their  type.  For  example,  the  ICMP  message  type  can 
be  found  in  a  field  named  “Type”,  but  other  protocols  might  use  the  string, 
“Command”  as  the  field  name.  For  this  reason,  each  protocol  would  need  to  be 
treated  differently  when  extracting  its  type.  The  “packetTypeExtration.py”  tool 
takes  in  the  “fields.txt”  file  produced  by  the  “pdmlExtractor.py”  program  and  reads 
line  by  line,  comparing  every  string  against  a  dictionary  of  common  words.  This 
dictionary  is  specified  inside  this  method  and  is  easily  expandable.  If  a  type  is  found 
within  the  packet,  the  contents  of  that  field  will  be  stored  in  an  array  data  structure 
in  Python  named,  “allTypes”.  If  this  is  the  first  occurrence  of  the  type  it  will  also 
be  stored  in  a  separate  array  of  unique  message  types  called  “uniqueTypes”.  Once 
all  of  the  types  are  found,  the  array  containing  the  unique  message  types  found  in 
the  PCAP  file  is  used  to  generate  the  “packetTypeSequences.txt”  file.  This  text  file 
is  used  to  generate  the  state  machine  using  Prospex.  Because  Prospex  can  only 
accept  integer  data  types  in  the  sequences  file,  we  use  the  index  of  their  position  in 
the  unique  array  as  the  message  type  in  the  order  that  they  are  found  in  the 
“alltypes”  array.  Figure  13  illustrates  this  process. 


Fig.  13  “packetTypeSequence.txt”  generation 


2.2.4  "fieldVocab.py" 
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The  “fieldVocab.py”  program  will  utilize  the  “fields.txt”  output  file  from  the 
“pdmlExtractor.py”  tool  to  generate  the  vocabulary  for  the  protocol.  This  tool  will 
look  at  every  field  for  each  packet  and  will  store  the  values  that  are  different.  Some 
of  the  field  values  never  change  and  therefore  will  only  have  one  vocabulary  stored, 
while  other  fields  may  have  a  different  value  for  every  packet.  This  process  is  done 
by  creating  a  field  class  for  every  field  in  the  packet.  Figure  14  illustrates  this 
process. 


Fig.  14  Vocabulary  generation 

As  shown  in  Fig.  14,  the  initial  step  is  to  create  an  array  of  field  objects.  Each  field 
object  contains  a  list  that  will  hold  the  vocabulary  for  that  field.  The  vocabulary  in 
every  field  of  each  packet  in  the  “fields.txt”  file  will  be  compared  to  the  values  that 
are  already  stored  in  the  vocabulary  list  for  that  object.  If  that  value  does  not  exist 
in  the  list,  it  will  be  added.  If  it  exists,  then  the  value  will  be  ignored.  This  process 
will  occur  for  the  entire  “fields.txt”  file.  Once  this  process  is  complete,  each  list 
that  is  generated  for  the  fields  will  have  their  entropy  calculated.  The  calculated 
entropy  describes  the  randomness  of  the  vocabulary  within  a  list.  This  entropy  is 
calculated  using  the  number  of  values  that  are  stored  in  the  vocabulary  list;  the 
length  of  the  list.  For  example,  Fig.  14  shows  that  the  “Type”  object  contains  a  list 
of  2  values:  0  and  8.  This  means  that  there  are  only  2  possible  values  for  the  “Type” 
field  in  the  entire  PCAP  file.  The  entropy  for  this  vocabulary  is  0.5.  The  algorithm 
used  to  calculate  entropy  is: 

Entropy  =  (x-  l)/x,  where  x  is  the  size  of  the  vocabulary  (1) 

This  algorithm  ensures  that  entropy  is  never  >1.  The  larger  the  vocabulary,  the 
closer  the  entropy  is  to  1 .  If  there  is  only  one  item  in  the  vocabulary,  the  entropy  is 
0.  This  means  that  the  value  is  the  same  in  every  packet  in  the  PCAP  file.  Once  the 
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entropy  is  calculated,  the  values  are  used  to  generate  the  standardized  XML  file  for 
the  model  generation.  Figure  15  shows  an  example  of  this  file  for  the  Ping  protocol. 


Fil#  Edit  Viaw  Search  Terminal  Help 


summerping> 

<mtype  id  =  '0'> 

<mf ield> 

<mname>icmp.  type</mname> 

<mshowname>Type :  8  (Echo  (ping)  request)</mshowname> 

<msize>1 </msize> 

<mpos>34</mpos> 

<mshow>8</mshow> 

<mvocab>08;</mvocab> 

<mentropy>0 . 0</mentropy> 

</mf ield> 

<mf ieid> 

<mname>icmp.code</mname> 

<mshowname>Code :  0</mshowname> 

<msize>1 </msize> 

<mpos>35</mpos> 

<mshow>0</mshow> 

<mvocab>00 ; </mvocab> 

<mentropy>0.0</mentropy> 

</mf ield> 

<mf ield> 

<mname>icmp.checksuin</mname> 

<mshowname>Checksum :  0x6fe9  [correct] </mshowname> 

<msize>2</msize> 

<mpos>36</mpos> 

<mshow>0x6fe9</mshow> 

<mvocab>6fe9;f4e2;b3da;75d2;73cb;c8c5;9dc0;a7b8;95b1 ; 55ac ; </mvocab> 
<mentropy>0.9</mentropy> 

</mf ield> 

<mf ield> 

<mname>icmp. checks um_bad</mname> 

<mshowname>Bad  Checksum:  False</mshowname> 

<msize>2</msize> 

<mpos>36</mpos> 

<mshow>0</mshow> 


Fig.  15  Model  standardized  XML  file 


The  standardized  file  is  structured  based  on  the  message  types.  In  this  Ping 
example,  there  are  only  2  types  of  messages:  0  and  8.  These  are  separated  and 
encapsulate  all  of  the  fields  that  pertain  to  that  type.  Each  field  also  contains  the 
name  of  the  field  along  with  their  size,  position,  vocabulary,  and  entropy.  This  is 
used  later  to  generate  the  protocol  model. 

2.2.5  "fieldVocabToNs-3Model.py" 

This  Python  script  takes  in  the  model  standardized  XML  file  and  will  generate  the 
C++  model  files  needed  by  ns-3.  The  script  executes  4  tasks  to  create  the  ns-3 
model: 

1)  Extract  all  field  attributes  (i.e.,  names,  sizes,  vocabulary). 

2)  Using  the  data  from  1,  C++  variables  are  generated  (using  names)  along 
with  data  types  (using  sizes). 

3)  Append  initialization  values  for  variables  (the  first  value  in  the  vocabulary 
is  used). 
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4)  Append  comments  in  the  C++  code  that  indicate  the  entropy  associated  with 
a  variable  (i.e.,  based  on  vocabulary). 

2.2.6  Prospex 

Prospex  was  used  due  to  its  State  Machine  Inferencing  tool.  This  tool  takes  in  the 
“packetTypeSequences.txt”  fde,  which  is  described  in  Fig.  3.  Refer  to  the  Prospex 
Section  2. 1.2.3  for  further  details.  There  are  several  files  that  are  produced  by  the 
program.  The  DOT  file  named,  “labeledStateMachine.dot”  is  the  one  that  we  are 
interested  in.  This  file  will  be  used  to  create  the  grammar  and  state  machine,  for  the 
protocol. 


2.2.7  "dotToCppGrammar.py" 

This  module  will  read  in  the  “labeledStateMachine.dot”  file  that  is  generated  by  the 
State  Machine  Inferencing  tool  provided  by  Prospex.  The  format  in  which  this  file 
is  generated  makes  it  easy  to  translate  into  a  C++  file  that  is  needed  for  the  model 
generation.  Figure  16  shows  an  example  DOT  file  for  the  Ping  protocol. 


States 

Transitions 


File  Edit  View  Search  Terminal  Help 


digraph  message* 

nodeO  [label="8"  , shape=doublecircle  ]; 
node29  [label="0  8"  , shape=circle  ]; 
nodeO  ->  node29  [label="8M] ; 

[label="0' 


Inode29  ->  nodeO 

> 


'] ; 


Fig.  16  Ping  “labeledStateMachine.dot”  file 


The  Fig.  16  example  shows  how  we  distinguished  states  and  transitions.  The 
“dotToCppGrammar.py”  program  takes  advantage  of  this  by  creating  State  objects 
in  Python  that  will  each  contain  their  own  transitions.  This  process  is  illustrated  in 
Fig.  17,  which  shows  each  state  and  the  list  of  transitions  stored  inside  the  state 
object.  The  outer  arrows  describe  the  transitions  that  are  included  in  the  objects. 
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nodeO 


8 


Fig.  17  State  objects  with  stored  transitions 

The  characteristics  of  this  state  machine  are  all  incorporated  into  the  C++  code  that 
is  generated  by  this  script.  Included  in  this  code  is  a  function  called  ' 
“getNextState()”,  which  is  used  to  know  the  message  type  that  needs  to  be 
described  in  the  model  at  a  specific  point  during  the  simulation.  The  state  object  is 
responsible  for  keeping  track  of  the  protocol  current  state  and,  when  requested,  will 
return  the  next  possible  states.  The  “getNextState()”  method  will  accept  an  integer 
value  that  represents  the  message  type  the  model  is  changing  to,  or  receiving,  and 
the  method  will  return  the  possible  types  that  the  model  can  then  change  to,  or  reply 
as.  This  is  all  written  in  one  C++  file  generated  by  the  “dotToCppGrammar.py” 
program  (i.e.,  <protocolName>.cc.) 

2.3  Accuracy  Analysis 

For  testing  purposes,  as  part  of  ns-3,  a  PCAP  file  is  generated  when  the  simulation 
of  the  Ping  protocol  is  run.  This  capture  file  is  used  to  compare  the  packets  that  are 
produced  by  the  simulation  with  the  actual  packets  in  the  original  PCAP.  In 
addition,  the  packets  are  also  compared  with  those  from  existing  ns-3  models 
available  through  ns-3  and  other  resources.  This  process  shows  firsthand,  the 
accuracy  of  the  ns-3  models  being  produced  by  the  model  generator. 

The  experiments  that  we  ran  used  a  10-second  (s)  PCAP  capture  using  the 
Wireshark  network  protocol  analyzer.  This  program  takes  in  the  PCAP  file,  the 
name  of  the  protocol  that  we  want  to  simulate  with  ns-3,  and  the  naming  convention 
we  want  to  use  for  the  model  (see  Fig.  18). 
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python  ns-3ModelGenerator.py  lOsecs.pcap  ICMP  ICMPTest 

1 - 1 - '  L-rJ  ' - 1 - J 

PCAP  File  Protocol  Model  Name 

Fig.  18  ns-3  Model  Generator  script  execution  parameters 

The  entropy — which  is  included  in  the  model  standardized  XML  file — is  crucial 
for  the  accuracy  of  the  protocol  models.  This  is  because  some  protocols  might 
include  a  large  list  of  possible  vocabulary  for  some  fields.  If  this  is  the  case,  the 
user  must  manually  choose  one  of  the  vocabularies  for  that  field,  or  choose  to  ignore 
it  completely.  This  can  affect  the  fidelity  of  the  models. 


3.  Results 


This  tool  was  tested  using  the  ICMP  Ping  protocol.  It  created  an  accurate  Ping  ns- 
3  model  and  generated  a  simulation  file  that  runs  successfully  on  ns-3.  The  ns-3 
simulation  also  creates  a  PCAP  file,  as  described  earlier,  which  we  use  to  determine 
the  accuracy  of  the  model.  By  comparing  the  packets  that  are  produced  by  our 
simulation  with  the  actual  packets  in  the  original  PCAP  file,  we  find  that  the 
autogenerated  model  produces  data  very  similar  to  the  original. 


Figure  19  shows  images  of  the  PCAP  files  (i.e.,  in  Wireshark).  The  original  10-s 
capture  is  the  ground  truth.  The  ns-3  sample  PCAP  contains  only  null  values  in  the 
data  field.  The  PCAP  file  generated  by  our  tool  contains  realistic  data  and  values 
that  match  the  data  field  from  the  10-s  capture.  In  the  generated  model,  several 
fields  had  low  entropy  (e.g.,  field  type,  code,  and  others)  and  some  had  high 
entropy.  High-entropy  fields  include  the  checksums,  timing,  and  sequence 
numbers.  While  human  intervention  will  be  required  to  more  closely  mimic  the 
original  protocol,  an  analyst  is  provided  with  an  annotated  template  as  a  starting 
point. 


Fig.  19  PCAP  accuracy  comparison 
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4.  Conclusions 


We  have  created  a  model  generator  and  demonstrated  how  it  can  be  used  to  extract 
protocol  fields,  vocabulary,  and  state  machine  specifications  for  a  Ping  protocol. 
These  features  are  crucial  in  the  generation  of  accurate  protocol  models.  By 
leveraging  Tshark  and  Prospex,  this  tool  can  generate  the  ns-3  C++  files  needed  to 
run  a  simulation  of  a  protocol  from  a  PCAP  file.  Due  to  the  vast  implementations 
of  different  protocols,  the  entropy  of  the  vocabulary  needs  to  be  calculated  to 
determine  the  values  included  in  the  model.  The  model  standardized  XML  file  lists 
the  possible  vocabulary  that  can  be  chosen  for  all  fields. 

Some  future  near-term  improvements  that  need  to  be  made  include  testing  with 
more  protocols  (especially  at  different  layers  of  the  OSI  model),  implementing  an 
inference  engine  to  extract  inter-  and  intrapacket  dependencies,  and  also  adding  the 
ability  to  generate  models  that  can  be  used  in  other  simulator/emulators  as  well  as 
network  scripting  engines  (e.g.,  scapy). 
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Appendix.  Detailed  Analysis  of  Protocol  Reversing  Tools 
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A.l  Setting  Up  the  Automatic  Semantics-Aware  Analysis  of 
Network  Payloads  (ASAP)/Protocol  Inspection  and  State 
Machine  Analysis  (PRISMA) 

1)  Download  PRISMA  from  https://github.com/tammok/PRISMA. 

2)  Download  Sally  from  http://www.ml.sec.  org/sally/. 

3)  Install  the  latest  version  of  R. 

It  is  necessary  for  the  version  of  R  to  be  3.2  or  higher. 

“R”  Install  Instructions 

a)  Add  the  following  lines  to  source. list  in  etc/apt/sources. list. 

##  R  BACKPORTS  FOR  WHEEZY 

deb  http://cran.revolutionanalytics.com/bin/linux/debianwheezy-cran3/ 
#deb-src 

http://cran.revolutionanalytics.com/bin/linux/debianwheezycran3/ 

b)  Then  do  the  following  commands: 
apt-get  update 

apt-get  upgrade 

apt-get  install  r-base  r-base-dev 

4)  Run  “R”  and  enter  the  following  commands.  This  step  will  install  the 
necessary  libraries  into  R  for  running  ASAP/PRISMA. 

install.packages  (“PRISMA”) 

library  (“PRISMA”) 

There  are  some  dependencies  that  might  need  to  be  installed  manually  due  to 
incompatible  versions.  This  can  be  done  using  the  same 
command:  install. packages (“<NAMEOFDEPENDENCY>  ”). 

5)  Install  Sally. 

a)  Install  the  necessary  packages  as  listed  in  the  README.md  provided  in 
the  downloaded  Sally  directory.  To  do  so  enter  the  following 
commands: 

1 .  apt-get  install  libz-dev 
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2.  apt-get  install  libconfig8-dev 

3.  apt-get  install  libarchive-dev 

4.  (if  gee  isn’t  installed)  apt-get  install  gee 
b)  Run  the  following  commands: 

1.  ./configure 

2.  make 

3.  make  check 

4.  make  install 

Sally  should  be  ready  to  run. 

Running  ASAP  on  a  sample  Payload. 

6)  From  here  the  steps  from  the  tutorial  should  be  followed.  The  steps  are  listed 
here  as  well. 

a)  Extract  asap.tar.gz  which  is  located  under  /PRISMA- 

master/ inst/ extdata/. 

b)  Create  the  Sally  output  file  from  the  raw  data  provided  by  using  the 
following  steps. 

It  is  necessary  to  change  some  of  the  features  in  the  sally.cfg  file. 

To  do  this,  follow  these  steps: 

i.  Open  the  sally.cfg  file. 

ii.  Under  the  feature  configuration  section  add  the  following 
line:  granularity  =  “tokens”. 

iii.  Under  the  same  section,  change  the  variable  name 
“ngram  delim”  to  “token  delim”  without  changing  its  value 
delimiter. 

iv.  Save  changes. 

Create  the  Sally  file  using  the  following  command, 
sally-c  asap.cfg  asap.raw  asap. sally 

c)  To  speed  up  the  loading  of  the  data  in  “R”,  apply  the 
“sallyPreprocessing.py”  Python  script  with  the  following  command: 
python  sallyPreprocessing.py  asap. sally  asap.fsally. 
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d)  Load  the  data  into  “R” 

i.  Start-up  “R”  (enter  “R”  in  the  terminal). 

ii.  Load  the  data  via:  loadPrismaData  (“PRISMA”). 

Following  the  tutorial,  additional  information  is  provided  about  the  payload  that  is 
being  analyzed. 

Keywords: 

1)  Payload:  a  string  of  bytes  contained  in  network  communication. 

2)  T-test:  a  statistical  t-test  result  is  one  in  which  the  difference  between  2 
groups  is  unlikely  to  have  occurred  due  to  chance. 

3)  Pearson  correlation  coefficient:  a  measure  of  the  linear  correlation 
between  2  variables  X  and  Y,  giving  a  value  between  +1  and  -1 
inclusive. 

Converting  to  XML: 

T-shark  command  to  get  PDML  transformation: 

tshark-r  “<.pcap_filc>”  -T  pdml  -E  separators  >  <.pdml_file> 

-r  :  read  from  file 
-T :  transfonn  to  specified  format 
-E:  Use  delimeter 
>  output  to  specified  file 

Command  to  get  all  the  protocols  that  exist  within  the  PCAP  file  to  extract  all 
protocols: 

tshark  -r  test. cap  -z  io,phs  -q  \tr  -s  ' '  \  cut  -f2-d''\  tail  -n  +7  \  head  -n  -1 

A-2  Netzob  Detailed  Evaluation 

We  evaluated  Netzob  by  running  through  the  example  provided  on  the  website 
tutorial.  We  were  able  to  complete  the  tutorial  using  the  sample  data  provided  on 
the  website  and  we  then  proceeded  to  rerun  the  tutorial  with  our  10-second  (s) 
ICMP  PCAP  file. 

The  following  are  the  steps  in  the  tutorial: 

•  Import  of  a  PCAP  file 
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•  Format  message  inference 

o  Partitionment  of  messages  following  a  specific  delimiter 

o  Regroupment  of  messages  following  a  specific  key  field 

o  Partitionment  of  a  subset  of  each  message  following  a  sequence 
alignment 

o  Search  for  relationships  in  each  group  of  messages 

o  Modification  of  the  format  message  to  apply  found  relationships 

•  Grammar  inference 

o  Generation  of  an  automaton  with  one  main  state  according  to  a 
captured  sequence  of  messages 

o  Generation  of  an  automaton  with  a  sequence  of  states  according  to 
a  captured  sequence  of  messages 

o  Generation  of  a  Prefix  Tree  Acceptor  (PTA)  automaton  according 
to  a  captured  sequence  of  messages 

•  Traffic  generation  and  fuzzing 

o  Generation  of  messages  following  the  inferred  message  fonnat  of 
each  group  and  through  visiting  the  inferred  automata 

o  Fuzzing  of  an  implementation  by  generating  altered  message 
formats 

We  tested  with  3  versions  of  Netzob.  First,  we  used  the  official  stable  release  of  the 
software  (that  allows  using  the  guided  user  interface  [GUI]).  Second  we  tried  with 
the  github  branch  labeled  “next”,  and  third  with  the  source  code  branch  labeled 
“master”.  It  seems,  by  looking  at  the  Git  log  that  the  developers  quickly  removed 
the  GUI  functionality  after  the  release  of  Netzob  0.4.1,  probably  due  to  several 
modifications  to  the  application  programming  interface  (API);  making  the  GUI 
interface  no  longer  usable.  We  noticed  that  a  developer  by  the  name  of  Georges 
was  actively  updating  both  branches  of  the  source  code  (different  files  in  each 
branch). 

While  running  Netzob,  the  first  problem  occurred  when  importing  the  PCAP  file 
from  the  tutorial.  We  had  to  modify  the  call  to  the 
“PCAPImporter.readFile(“target_src_vl_sessionl.pcap”).values()”  by  adding  a 
parameter  that  specified  that  data  only  up  to  the  network  Layer  3  (i.e.,  the  Internet 
Protocol  [IP]  layer)  should  be  parsed.  After  fixing  the  “PCAPImporter”  call,  the 
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tutorial  code  ran  all  the  way  through  to  completion,  but  the  results  were  incorrect. 
In  the  next  step  of  the  tutorial,  the  authors  show  that  they  partition  the  messages  in 
the  sample  PCAP  using  a  “#”  delimeter.  This  character  is  used  to  separate  the 
commands  in  the  command  and  control  (C2)  traffic.  In  our  PCAP  file  (i.e.,  the  10- 
s  live  Ping  capture),  this  character  was  present,  but  it  was  not  meant  to  be  a 
delimeter.  We  looked  into  other  ways  to  partition  the  data. 

There  are  several  ways  that  one  can  parse  symbols  (or  packet  types).  The  tutorial 
uses  sequence  alignment,  but  with  Ping — in  this  particular  case  with  the  loopback 
traffic — the  fields  were  parsed  out  most  correctly  when  using  the  simple 
partitioning  technique  (this  technique  separates  data  based  on  the  dynamicity  of 
bytes,  whether  they  change  across  messages.)  Looking  more  closely  at  the  source 
code,  we  identified  the  location  of  the  partitioning  functions: 
“src/netzob/Inference/Vocabular/Format.py”. 

In  the  code,  the  name  of  the  module  is  “Format”  and  the  function  is  called 
“splitStatic”.  It  is  possible  to  call  this  function  with  4  parameters  the  “unitSize”, 
“mergeAdjacentStaticFields”,  and  “mergeAdjacentDynamicFields”.  We  obtained 
the  best  results  specifying  only  the  unitSize.  We  also  tried  splitting  the  input  into 
8-bit  segments  and  then  using  merge  to  create  fields  from  the  ICMP  Ping 
specification.16  While  this  did  work  for  display  purposes  (i.e.,  splitting  fields  into 
8-bit  segments),  when  merged  the  field  types  become  aggregate  instead  of  raw 
causing  incompatibility  with  the  other  algorithms  in  Netzob  (e.g.,  the  clustering 
algorithm  no  longer  worked  on  the  data).  We  proceeded  by  using  the  splitStatic 
function  with  only  the  symbol  array  as  an  input.  We  originally  thought  that  we 
could  create  a  converter  module  that  could  take  as  input  a  protocol  specification 
and  then  use  Netzob  to  parse  symbols  based  on  that  specification.  However,  we 
were  unsuccessful  in  finding  a  nontrivial  way  to  do  this. 

The  next  step  in  the  tutorial  uses  the  “clusterByKeyField”  function  in  the  Fonnat 
module  (the  actual  file  is  located  at 

src/netzob/inference/Vocabulary/FonnatOperations/ClusterByKeyField.py).  This 
will  use  values  in  the  fields  that  are  specified  to  cluster  message  types.  When  we 
ran  this  with  Field  5  (i.e.,  the  ICMP  type  field  that  contained  2  values:  \x00  and 
\x08),  Netzob  (the  next  branch)  would  crash  stating  that  the  last  field  in  the  message 
was  not  aligned  correctly  (i.e.,  by  the  ParallelDataAlignment.py  module).  We 
modified  the  code  to  use  the  DataAlignment  module  instead.  We  isolated  the 
problem  in  the  “ClusterByKeyField .py”;  it  was  using  the  “\x00”  (i.e.,  valid  ASCII) 
character  to  infer  the  type  of  the  field  as  ASCII.  When  “\x08”  was  read,  the 
character  was  invalid  (i.e.,  not  a  valid  ASCII  character)  and  Netzob  would  crash. 
We  modified  the  code  to  force  the  type  as  HexaString.  This  fixed  this  problem. 
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The  splitStatic  algorithm  worked  well  with  this  particular  instance  of  the  Ping 
traffic  (which  was  loopback  traffic,  but  did  not  work  the  same  way  when  using  a 
capture  of  pings  to  a  google.com  server;  this  was  strange  because  the  Netzob  0.4.1 
stable  release  gave  correct  results). 

We  executed  the  next  step  in  the  tutorial  (i.e.,  search  for  relationships  in  each  group 
of  messages),  but  this  produced  no  relationships. 

Afterwards,  we  used  the  Automata  module  to  generate  state  machines  using 
different  functions:  “generateChainedStateAutomata”, 

“generateOneStateAutomata”,  and  “generatePTAAutomata”.  The 
“generateChainedStateAutomata”  generates  all  possible  states  for  each  unique 
transition.  The  “generateOneStateAutomata”  generates  a  universal  receiver  (i.e., 
will  accept  all  traffic  as  valid  input).  Regardless  of  what  is  received,  it  will  respond. 
The  “generatePTAAutomata”  takes  as  input  several  communication  sessions  and 
then  identifies  common  paths  and  merges  these  into  a  single  automata.  The 
resulting  state  machines  only  consisted  of  3  states:  start  state,  open  channel,  and 
close  channel  (i.e.,  end  state).  To  generate  an  automata  that  can  work  with  real 
traffic,  the  user  must  pass  real  traffic  into  Netzob.  Only  then  will  a  usable  grammar 
(i.e.,  state  machine)  be  generated. 

The  final  steps  in  the  tutorial  create  a  traffic  generator  and  a  fuzzer.  Because  the 
previous  steps  did  not  yield  successful  results,  we  were  not  able  to  complete  these. 
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List  of  Symbols,  Abbreviations,  and  Acronyms 


API 

Application  Programming  Interface 

ASAP 

Automatic  Semantics- Aware  Analysis  of  Network  Payloads 

BRITE 

Boston  University  Representative  Internet  Topology  Generator 

C2 

command  and  control 

DNS 

Domain  Name  System 

FTP 

File  Transfer  Protocol 

GUI 

Graphical  User  Interface 

HTTP 

Hypertext  Transfer  Protocol 

ICMP 

Internet  Control  Message  Protocol 

IP 

Internet  Protocol 

NFS 

Network  File  System 

NIC 

Network  Interface  Card 

OLSR 

Optimized  Link  State  Routing 

OSPF 

Open  Shortest  Path  First 

OSI 

Open  Standards  Interconnect 

PDML 

Packet  Details  Markup  Language 

PRISMA 

Protocol  Inspection  and  State  Machine  Analysis 

PTA 

Prefix  Tree  Acceptor 

SMB 

Server  Message  Block 

SMTP 

Simple  Mail  Transfer  Protocol 
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